Inversion method for content-based networks 
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In this paper, we generalize a recently introduced Expectation Maximization (EM) method for 
graphs and apply it to content-based networks. The EM method provides a classification of the nodes 
of a graph, and allows to infer relations between the different classes. Content-based networks are 
ideal models for graphs displaying any kind of community or/and multipartite structure. We show 
both numerically and analytically that the generalized EM method is able to recover the process 
that led to the generation of such networks. We also investigate the conditions under which our 
generalized EM method can recover the underlying contents-based structure in the presence of 
randomness in the connections. Two entropies, S q and S c , are defined to measure the quality of the 
node classification and to what extent the connectivity of a given network is content-based. S q and 
S c are also useful in determining the number of classes for which the classification is optimal. 

PACS numbers: 89.75.Hc, 02.50.Tt 



INTRODUCTION 



Classifying items with respect to their properties is 
a fundamental and very old problem. If the proper- 
ties are inherent to the objects, the difficulty is deciding 
first how many groups are required and then establish- 
ing the discrimination thresholds for each. The matter 
becomes more complicated when instead of the inherent 
properties, one tries to classify elements based on mu- 
tual interactions. Of course, such classifications would 
be very useful for a better understanding of the mech- 
anisms underlying the behavior of systems encountered 
in scientific disciplines as diverse as Sociology, Biology or 
Physics [H, H, H, 13| ■ As an example, consider social sys- 
tems which are often modeled as networks. The vertices 
represent individuals and the edges interactions between 
them. These interactions can be of many types: friend- 
ship, belonging to the same club or school, working to- 
gether, etc. In these graphs, it is important to be able to 
group the nodes into what is commonly known as com- 
munities. That is, groups of vertices that share a higher 
number of connections among themselves than with the 
rest of the network 

SSSSl (see also 

for a recent 

review). This partition bears information on which per- 
sons have a stronger interdependence and may allow to 
predict the actors that drive the dynamics of the group as 
a whole. In Biology, on the other hand, network methods 
have been used to understand gene regulatory patterns 
[TH . Here each vertex corresponds to a gene and an edge 
contains information on how the associated protein regu- 
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lates the synthesis of the protein associated to the second 
gene. Since regulation of gene activity plays a fundamen- 
tal role in the functioning of the cell |l2j ] , the community 
structure points towards the different functional subunits 
(see [l3| and references therein). Given the relevance of 
communities, the last years have seen an increase in the 
number of techniques proposed to detect them. To name 
a few: some of them are based on the concept of be- 
tweenness (number of paths passing through a link) and 
modularit y B , 0, [3] , others on synchronization of oscil- 
lators (lEl. Il6| or on other dynamical systems running on 
the network jl7l. [l8l Il9| , detection of overlapping cliques 
20] or the diffusion of random walkers [2l|, 0, HH • 

Nevertheless, communities are not the only relevant 
information that can be extracted from networks. It is 
also possible to search for vertices with similar connec- 
tion patterns (not necessarily having connections among 
themselves, as in the case of communities) that are ex- 
pected to play equivalent functional roles. In the social 
networks literature such nodes are referred to as struc- 
turally equivalent [24| and have lead to an analysis of so- 
cial networks based on Block Modeling 0, H|| . In many 
types of networks, like those formed by webpages or so- 
cial actors, the connection between nodes is often due to 
some intrinsic properties of the nodes, which we will refer 
to henceforth as their "contents". Thus it is possible to 
consider an alternative point of view in which a network 
structure arises as a result of node contents, lea ding one 
to the notion of contents-based networks (2a. l27l l28l |29|| . 

In many cases, network analysis approaches based on 
communities and those based on some form of node sim- 
ilarity are aimed towards the understanding of very dif- 
ferent questions. When viewed within the framework of 
contents based networks, however, these differences dis- 
appear as will be argued below. We will also show that 
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an extension of Newman and Leicht's Expectation Max- 
imization (EM) method [3.03 is well-suited for uncovering 
content-based structure underlying a network, inverting 
in practice the process that lead to its formation. 

The organization of the paper is as follows: in Section 
II, content based networks are formally introduced. Next, 
we describe in Section III our generalization of the EM 
method to directed graphs. In Section IV, we show how 
the EM method can be used to solve the inverse problem, 
namely to recover the underlying contents-based struc- 
ture from a given network. We present in Section V 
analytical results regarding the application of the EM 
method to contents-based networks and the recovery of 
the contents-based structure. These results will be com- 
plemented with a numerical study in Sections VI and 
VII. In Section VII, we consider a more realistic situa- 
tion and ask to what extent an underlying contents-based 
structure can be recoverred in the presence of disorder in 
the connections. Finally, we summarize our results and 
present the conclusions in Section VIII. 



II. CONTENT BASED NETWORKS 

Let us define first content-based networks. Consider a 
set of nodes i = 1, 2, ... N each of which has a content Xi 
assigned with Xi G X = {1, 2, . . . , Af x }, and where 1,2,... 
are labels for the possible contents. The structure of 
the connectivity pattern of the associated content-based 
network is determined by the function c{xi, Xj) G {0, 1}, 
which is defined for all ordered pairs of contents (x, y) £ 
X. The adjacency matrix of the graph is then given by 
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(1) 



We see immediately that nodes having the same con- 
tents x also have the same connection patterns, and thus 
are structurally equivalent As explained before, this 
can imply a functional equivalence in the process that 
generated the network. The point of view that we will 
take in this article is to regard contents-based networks 
as ideal networks, from which the "real" networks are 
obtained through alteration or removal of some of the 
connections. Note that the range of topologies that can 
be generated via contents-based network is very broad: 
if the connectivity function c{x,y) shows a close to di- 
agonal configuration, the network will be formed by a 
set of almost insulated cliques. The ideal configuration 
would be a family of independent communities without 
interconnections. Another configuration that can be eas- 
ily reproduced with content-based networks are bipartite 
graphs. In its most simplest from, it is enough to allow 
the nodes to take one of two possible contents and letting 
the connectivity function c to be non-zero only for the 
off-diagonal elements. Much more complicated connec- 
tivity patterns can be actually achieved by introducing 
finer contents distinctions and more intricate connectiv- 
ity functions. Thus a content based graph can in general 
include all sorts of combinations between community-like 
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FIG. 1: An example of content based network, the colors 
correspond to the different contents (green A, red B, blue C, 
magenta D, cyan E, olive F and orange G). 



and/or multipartite graphs, as can be seen in the example 
plotted in Figure 1. 

Another point to note is that originally these networks 
were proposed in a context where the relation between 
contents was an order relation [2(| [3l|, [HJ • This implies 
that the relation between nodes is not symmetric and the 
network is therefore more naturally represented by a di- 
rected graph. In this case, the connectivity function c is 
non-symmetric in its arguments. Apart from directional- 
ity, realistic graphs may present, as well, a certain degree 
of disorder in their connection patterns. This effect can 
be incorporated into the mathematical description by re- 
garding the values of c(x, y) as probabilities of having a 
link from a node of content a; to a node of content y. This 
view transforms the content-based network into a hidden 
variable graph [jjj H3, HH • As we will see later, the EM 
method is still able to extract information from networks 
produced in this way but the failure rate increases the 
further c(x, y) deviates from a binary-valued function. 

Contents based networks have proven to be very use- 
ful in the description of phenomena that include an un- 
derlying relation of hierarchy or ordering. The simplest 
way of achieving such a relation is to associate with each 
node a string of letters and letting the relation between 
any two nodes be based on string inclusion: namely 
that one string is contained as an uninterrupted subse- 
quence in the other. Networks generated from random 
strings in this manner have been successfully used to 
model receptor-ligand interactions in the immune system 
[3ll |32| , and the transcription factor based gene regula- 
tory network in yeast Hfl [S3, [H, HU . 

In this article, our goal is to address the inverse prob- 
lem: Given a network of which we know nothing in ad- 
vance, is it possible to decide whether there is an underly- 
ing contents-based structure and, if so, can we deduce the 
class membership of its nodes and the class connectivity 
function? Moreover, can this be achieved in the presence 
of noisy connections? Seen in this way, the problem at 
hand becomes one of statistical inference, very well-suited 
to EM methods SUS- 



III. THE EM METHOD FOR NETWORKS AND 
ITS GENERALIZATION 

Given a graph Q of N nodes and an adjacency ma- 
trix Aij, the Expectation Maximization (EM) algorithm 
[lOl searches for a partition of the nodes into Af e groups 
such that a certain log-likelihood function for the graph 
is maximized. Henceforth we will refer to the groups into 
which the EM method divides the nodes, as classes. Note 
that j\f c must not be confused with the number of con- 
tents Af x , described in the previous section. Ideally, the 
optimal number of classes would be N x , but a criterion 
independent from the EM algorithm is required to deter- 
mine its value since we assume that in general M x will 
not be known in advance. The variables of the algorithm 
are the probabilities ir r that a randomly selected node is 
assigned to class r, with r = 1, 2, . . . j\f c , and the set of 
probabilities 9 r j of having a connection from a node be- 
longing to class r to a certain node j. Assuming that the 
functions 9 and ir are given, the probability Pr(A, g\ir, 9) 
of realizing the given graph under a node classification g, 
such that Qi is the class that node i has been assigned to, 
can be written as 



Pr(4sk,' 



n 



n 



9i,3 



(2) 



Pr(A,g\ir, 9) is the likelihood to be maximized, but it 
turns out to be more convenient to consider its logarithm 
instead 



£(7T,< 



E 



(3) 



Treating the a priori unknown class assignment gi of the 
nodes as statistical " unknown data" , one introduces next 
the auxiliary probabilities qi r — ¥x{gi\A, 7r, 9) that a node 
i is assigned to class r, and considers the averaged log- 
likelihood constructed as 



A 71 ": °) = E qi ' 



In 7r r + A^ In 9 r j 



(4) 



The maximization of C must be performed taking into 
account the following normalization conditions for the 
probabilities ir and 9 

Ma 

J2*r = 1 

r=l 
N 

3 = 1 

The final results are 
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FIG. 2: A simple scenario in which the EM method for di- 
rected networks, as defined in [3(j, has problems to classify 
the nodes of the network in two classes. The configurations a) 
and b) are possible outputs of the original EM method since 
both satisfy the normalization condition of Eq. (JSJl. The so- 
lution a) comes together with values for qi r = 1/2 for all the 
nodes and classes, while the solution b), which has a lower 
likelihood, produces qi r ~ 0.99 for all the nodes in one class 
and a very small probability for the other. The plot on the 
right, solution c), is the output offered by the generalization 
of EM with values of qi r virtually one and/or zero. 



where ki is the total degree of node i. The still unknown 
probabilities qi r are then determined a posteriori by not- 
ing that 



Pr( 5i = r\A,ir, 



Pr(A, gi = r\ir,9) 
Pr(AM) ! 



from which one obtains 



3 r 3 



(9) 



(10) 



Eqs. ([7]), and (fTO)) form a set of self consistent equa- 
tions for qi r , 9 r j and 7r r that any extremum of the ex- 
pected log-likelihood must satisfy. 

Thus, given a graph Q, the EM algorithm consists of 
picking a number of classes j\f c into which the nodes are 
to be classified and searching for solutions of Eqs. (O, 
([5]) and (fTDl. T hese equations were derived by Newman 
and Leicht [30]. They also showed that when applied to 
diverse type of networks the resulting qi r and 9 r j yield 
useful information about the internal structure of the net- 
work. Note that only a minimal amount of a priori in- 
formation is supplied: the number of classes j\f c and the 
network. 

However, the EM method in the form presented so far 
does not yet serve our purposes for the following reason: 
as remarked before, content-based networks are usually 
represented as directed graphs. The probability 9 r j was 
defined as the probability that a node j receives a di- 
rected connection from a node belonging to class ?\ To- 
gether with the normalization condition for 9 r j , Eq. ^ , 
this implies that the classification must be such that each 



4 



class r has at least one member with non-zero out-degree. 
This constraint forces the EM algorithm to classify a sim- 
ple bi-partite graph in the manner depicted in Figures 2a 
or 2b. From a content-based point of view, on the other 
hand, the classification that would be more natural is 
the one displayed Figure 2c which is forbidden by the 
condition of Eq. ©. This difficulty is not resolved by 
re-defining 9 r j instead as the probability that a node j 
makes a directed connection to a node belonging to class 
r, since now the classification must be such that each 
class r has at least one member with non-zero in-degree. 

We therefore have to generalize the EM approach in 
such a way that the node directionality does not re- 
strict the possible classification of the nodes. This can 
be achieved by introducing the probabilities 

• 9 r i of having a uni-directional link from a vertex 
of class r to a node i, 

• 9 r i of having a uni-directional link from node i to 
a node in class r, and 

• 9 ri of having a bidirectional link between i and a 
node in class r. 



With these new definitions Eq. @ becomes 
Pr (A,g\Tr,*9, 



(11) 



n 



n 



11 " 9i,3 
3 



Aji (1— Ay) ~q Aij (1— Aji) < /j 5 Aij Aji 



9i,3 



9i,3 



The likelihood can now be written as 

£(tt,0) = ^qtr hi7r r [(Aji (1 -Aij))ln *e r>j 

ir y j 

+ (Aij (I - A^)) In ~e rij + {Aij Aji) In ) , 

(12) 

which has to be maximized under the following constraint 
on the probabilities 9 r j , 



^ ( r,i + 9 T ,i + r ,i) — f , 



(13) 



implying that there is no isolated node. The probability 
7r r , that a randomly selected node belongs to class r, is 
again given by Eq. 0. 

Introducing the Lagrange multipliers f3 and A r , to in- 
corporate the constraints, Eqs. (0 and (|I3p . the expres- 
sion to be extremized becomes 



I = C + P\l-J2^rj 

+ ^2 X r(^-J2(®^+~8r, + V„)^J . (14) 



As before, the extremal condition on C with respect to ir 
gives us 



dir r 







N 4^ 



and = N, (15) 



where N is the total number of nodes. Differentiating C 
with respect to the 9 variables, we get (38j 



dC 



5C 

59 r . 
5l 



lirAji (1 - A^) - *Q rj X r = 

i 

« ^ qirAij (1 - Aji) - 9 rj X r = 

i 

« ^ < — > 

QirAij Aji — 9 rj A r = 0. 



(16) 

Putting together the three previous expressions and sum- 
ming over the index of the nodes j, we obtain the follow- 
ing result for the Lagrange multipliers 



A r — ^2 Qir (K + K — fej ) , 



(17) 



where k\, k° and k\ are the in-degree, out-degree and bi- 
directional degree of node i, respectively. Inserting this 
relation into the previous set of equations, we extract the 
new extremal conditions for the 0's 



J2i Qir Aji (I Aij) 



rj 



9 rj — 





-M) 


Ei QirAij (I — 


Aji) 






Yli QirAij A 


H 


£ 4 + 





(18) 



These expressions have to be again supplemented with 
the self-consistent equation for q ir which now reads 



Aji (1 — Aij) q Aij (\~Aji) 
rj rj 



Aij Aj 
r j 



E s Uj 



S3 



(l-Aij) -£A zj (1-Aj 



sj 



(19) 

Note that when we have only bi-directional links so 



that Aj 



A^, it follows from Eq. (|T8f that 



0. 



rj — kj. Thus we recover the original EM equations 

under the identification 9 r j = 9 r j- 

It is easily shown that the solutions of the EM equa- 
tions, Eqs. |0, (fl"8|) and (fl"9)) . are such that if two nodes 
i and j are structurally equivalent, i.e. Aik — Ajk as well 
as Aki — Akj , for all k then they will be classified in the 



same manner: qi r — qj r , and 9 . 



9 r j and 



= 9 r j for all r. 



This property of the solutions ob- 
tained from the EM methods renders it very well-suited 
for detecting any underlying contents-based structure. 
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IV. THE INVERSION METHOD 

One important shortcoming of the EM method is that 
Af c has to be provided as an external parameter. The 
algorithm lacks a means to evaluate how good a clas- 
sification is, and consequently one cannot decide which 
number of classes furnishes an optimal classification of 
the nodes of a graph. To overcome this problem, we pro- 
pose to define a measure of the quality of a classification 
as follows: 



S q = 



-T 



q ir ln(g ir ), 



(20) 



where the sum runs over all the nodes i and classes r. 
S q is the average entropy of the classification and as such 
measures the certainty with which the nodes are assigned 
to their respective classes. One can easily see that < 
S q < lnA/" c . For a sharp classification S q — 0, while the 
worst-case scenario occurs when qi r — 1/J\f c - We will 
later argue that S q is a useful statistic to infer J\f c . 

Once an optimal classification has been found, it is 
possible to determine the connectivity structure among 
the classes. Given an EM classification, we will define 
c(r, s) as the probability that a node in class r has a 
connection to one in class s. This probability can be 
estimated as 



c(r, s) = ^ — I 1 



Si Qir Ei Qjs 



by noting that 



PWr) 



E," Qjr 



(21) 



(22) 



is the posterior probability that given that a node has 
been assigned to class r, the node is i. The second term 
on the right hand side of Eq. (j2Tj) must be included as 
a correction for the absence of self-connections, since by 
convention, we assume that An — for all i. 

c(r, s), as defined above, is the probability of regarding 
a connection between two nodes in the graph as being one 
between nodes of type r and s. As we will show in the fol- 
lowing section, if the underlying graph is a contents-based 
network, a successful application of the EM algorithm 
should result in sharp assignments of nodes into classes 
and c(r, s) should thus be binary valued (and moreover 
be equal to the connectivity function c(r, s j). It is possi- 
ble to also define a measure of how close the connectivity 
function resembles one that corresponds to a content- 
based network by considering the entropy for c, 



AC 2 In 2 



c(r, s) In c(r, s) 



(23) 



We have that < S c < 1. The maximum of S c occurs 
when c(r,s) — 1/2, i.e. when none of the classes have 
any preferred connection pattern to any class. 



The generalization of the EM method, the entropies 
S q , S c and the estimation of c(r, s) are in general appli- 
cable to any kind of graph. However, for the purpose 
of this article we will focus only on their applications 
to content-based networks. We will address the general 
case in a subsequent work [39j . where we will also show 
that contents-based networks play a special role for the 
classifications of the EM method. 



V. ANALYTICAL RESULTS FOR 
CONTENTS-BASED NETWORKS 

Assume that we are given a contents-based graph Q 
that has been constructed from a set of nodes of unknown 
contents, and an unknown connectivity function c(x,y). 
In this setting, we suppose that the optimal number of 
classes M c has already been found and that it is equal 
to the number of contents Af x . We would like to know 
under which conditions the EM algorithm can infer the 
class membership of the nodes as well as the connectivity 
function. In other words, given the adjacency matrix 
Aij , we are looking for a solution of the generalized EM 
equations, Eqs. (TTg|) and (TlT)|) , with 



with Xi € X , 



(24) 



along with the unknown class-connectivity function 
c(r, s) that ideally should coincide with the original 
c(x,y). Note that the Ansatz Eq. ([24)) implies that for 
such a solution S q = 0. 

Substituting the above Ansatz into Eq. (|18[) . we find 

V . c{x v r) [l-c{r, Xj )} 



hi -X- ho h& 

c(r,Xj) [l~c(xj,r)] 

hi _i_ ho hb 

c(r, Xj)c{xj,r) 

hi i ho hb ' 

t\j r r\j- rv, r 



(25) 



where k l r , k° and are the average in-degree, out-degree 
and bi-directional degree of nodes belonging to class r, 



^ $xi,r (k>i k^ k^j — n r (jz r ~h k r k r ^j 



Thy kf • 



(26) 



so that k r is the total degree of each of the n r nodes 
belonging to class r. Note that in Eq. ([25]) . the node 
index j enters only through its content Xj , so that 9 r j is 
the same for all the nodes that have the same content as 
j. The same turns out to be true for the qi r . We thus 
have qi r = qt r for all nodes i such that Xi = t, and from 
Eq. (Til))) we obtain 



^ = ff I!, {[cMCl-c^r))]*'^ 1 - 



=(«.*)) 



x [c( S ,r)(l-c(r, S ))] c(M)(1 - c(M)) 

x [ C (r, S )c( S ,r)] c ^ c ^}, (27) 
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where j t is the normalization constant for q tr - 

We now have to consider the conditions on 
c(r, s), c(s,r), c(t, s), and c(s,t) such that given the 
classes r and t, the terms in the product on the right 
hand side of Eq. (|27|) are non-zero for all s, when r = t, 
and zero for at least one s when r =/= t. This is a state- 
ment about the kind of connections that the nodes of 
type r and t make to or receive from nodes of all possi- 
ble classes s. An inspection of the c c type terms in the 
product shows that their contribution to q tr is non-zero 
if and only if the following two conditions are satisfied 
for all s: 

• If there is a connection between t and s, there must 
be also a connection between r and s of the same 
kind, namely either in, out, or bi-directional. 

• Whenever there is no connection between t and s, 
there can be any kind of connection between r and 
s, as well as none at all. 

The satisfaction of both conditions can be regarded as 
constituting a cover type of relation between r and t, i.e. 
nodes belonging to class r connect in the same way with 
all the classes that nodes belonging to class t connect, but 
they have also some extra connections. We will denote 
this relation by r y t and say that r covers t. From its 
definition it is clear that the cover relation is transitive, 
r y t,t y s r y s. When r >- t, we also define £ (r; t) 
as the set of extra classes that r connects to (or receives 
connections from) relative to those of t. 

With the above definition, it can be readily seen that 
when r y 1 



ve£(r;t) 



(28) 



where the index v runs over the extra classes to which r 
is connected. This implies that 



Thus we find that 



1 + 



y 



G£(r;t) 



Qtr 







-k t 



r = t, 

r>~t, 
o/w. 



(29) 



(30) 



(with £(t] t) = 0). Thus, when r y t and for large kt, qtr 
deviates from our Ansatz, Eq. (I24|) . by an exponentially 
small amount. 

Treating the deviations caused by the presence of cover 
relations among the classes, as a small perturbation to 
our Ansatz, Eq. (|24|) . we can obtain the leading order 
expression for q tr as 



qtr 



7T« \, 



1 



ryt 7r t 



-k t 



r y t, 
o/w, 



(31) 



contents 
C D 



FIG. 3: Connectivity function c(x, y) for the theoretical 
example of Section V-A. The number of contents is six: 
A, B , C, D, E and F. The points represent the ones in the 
connectivity matrix, the values not marked are zero. 



where j t has been determined from the normalization 

£> r = l. (32) 

r 

To the same order, we find also that 



n r v— * Tit I r ef(t;r) nv 



r N ^ A/ I 



N 



Elf' 



t-<r 



-k t 



(33) 



Equations (f3"Tj) and ([3^)) are the analytical solution of the 
EM equations for a content-based network with connec- 
tivity function c(r, s). 

We see that whenever a class r y t, there is a non-zero 
probability for a node t to be also classified as belonging 
to class r. We will refer to this as a leakage in the class 
assignment. However as can be seen from Eq. ([3~T]) . the 
leakage probabilities vanish exponentially with the size of 
the classes. A detailed account of the solution structure 
for contents-based networks as well as more general types 
of networks will be given elsewhere [3!| . 

When the contents-based network is cover-free, the 
generalized EM equations have a leak-free solution and 
thus the entropy of the class assignments S q vanishes. 
On the other hand, in the presence of cover relations, the 
EM method will produce assignments with some nodes in 
multiple classes, i.e. leaks. We have already found above 
the leading order behavior for the leakage. It is not too 
hard to show that, in that case, S q is given by 



E E 

t has a cover ryt 



n r a(r; t) I 1 + 



-k t 



(34) 



where a(r;t) = Et>e£(Y-t) n « ^ s ^ ne number of nodes to 
which nodes in class r are connected in addition to those 
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that nodes in class t connect. In many practical situa- 
tions, the number of contents is fixed. This implies that 
if the probability of being in class r is given by p r , the 
actual number of nodes in the r class will grow on av- 
erage as n r = N p r with the system size. Thefore, the 
factor a(r; t) of Eq. (|34|) can also be written as 

(35) 



a(r; t) — N a(r; t), 



where a(r;t) is a constant depending on the connectiv- 
ity function that generated the network. Under these 
assumptions, the entropy S q will decrease exponentially 
with the network size, meaning that even for moderately 
sized networks the leakages will be in general too small 
to cause significant misclassification. 

As shown in Section IV, the solution of the EM equa- 
tions provides us with an estimate for the class connec- 
tivity, c(r, s) , given by Eq. (f2l"j) . For contents-based net- 
works in the absence of any cover relation among classes, 
we have, cf. Eq. (|22p . p(i\r) = 8 Xi . r /n r . and from 
Eq. (|2ip we immediately find that c(r, s) = c(r, s) with 
S c = 0. In the presence of cover relations among the 
classes, there will be corrections that vanish exponen- 
tially with the number of nodes in the relevant classes. 
These results demonstrate that the EM algorithm is ca- 
pable of inferring the hidden class connectivity function 
that generated the network. 



A. An Example 

In order to further illustrate the theoretical results 
above, we turn next to an example. Consider a network 
generated from six kinds of contents to be denoted by 
A, B, C, D, E and F, and with the connectivity function 
as shown in Figure 3. The following cover relations are 
present: B y A y F ; that is, B y A, B y F, and Ay F. 
In fact, we have chosen this particular example to eluci- 
date the effect of having nested covers and to show that 
the cover relation is transitive. For each of the cover re- 
lations, the sets of connections to additional classes are 



£{B;A) = {£>}, £(B;F) = {D,C} and £{A;F) = {C}. 
When inserted into Eq. (|3"Tj) , these relations yield 
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with qsB = Ice = Idd = Iee = 1 and all the other 
values of q rt — 0. These results are in agreement with 
what one would expect intuitively. For example, since 
B y A and B y F, there is a non-zero probability of 
mistaking nodes of type A or F by nodes of B, i.e. (?ab, 
qfb, and qFA are all non-zero. However this probability 
vanishes exponentially with the number of nodes in the 
classes E and C. In the large network size limit, the 
leakage on qi r , and how far S q deviates from zero, are 
determined by the pair of classes (r, f ) such that r is the 
"tightest" cover of t, these are the pairs r y t for which 
a(r; t) is smallest, Ay F and B y A, in our example. 
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FIG. 4: Connectivity functions c(x, y) for the two examples of 
content-based networks analyzed in the simulation sections. 
The number of contents considered is five, A, B, C, D and E. 
The contents of the connectivity function A) display no cover 
relation, while in the second example, £>), Ay B. The net- 
works are generated assuming equal probability for the five 
contents at the assignation of a content to each node. 
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FIG. 5: An example of classification, the original network is 
on the top and on the bottom the probability qi r is repre- 
sented graphically. The color of the symbols correspond to 
the contents of the nodes (green A, red B, magenta C, blue 
D and cyan E). On the bottom, the spheres radius is pro- 
portional to the probability qi r . On the left, the network is 
generated using the connectivity function ca of Figure 4 with 
no cover relation among the classes, while on the right we have 
used eg, which incorporates a single cover relation between A 
and B such that Ay B. 
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FIG. 6: In the low panels S q (circles) and its fluctuations as 
(squares) as a function of M c for the networks shown in Figure 
5. In order to facilitate visualization, the insets show the same 
curves in a semi-logarithmic plot. The top panels display 
the same quantities, S q and as, but ensemble-averaged over 
different realizations of the content-based networks generated 
with the connectivity function of Figure 4. 
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FIG. 7: On the top, the connectivity function c(r, s) obtained 
from the EM classification of the networks displayed in Figure 
5. The radii of the circles is proportional to the value of 
c(r,s). On the bottom, we are showing how S c goes with 
M c for the same networks as well as, in the inset, an average 
over different content-based realizations generated with the 
connectivity functions of Fig. 4. 



VI. SIMULATION RESULTS, EM APPLIED TO 
CONTENT-BASED NETWORKS 

In the following, we study numerically the ideas in- 
troduced in the previous sections. The generalized ver- 
sion of EM will be applied to directed content-based net- 
works generated randomly from the connectivity func- 
tions shown in Figure 4. The nodes of these networks 
have a content assigned that is selected at random out 
of M x — 5 five possible contents, denoted by A, B, C, D 
and E. Since the presence of coverage relations can 
change the quality of an EM classification, we have con- 
sidered two connectivity functions c{x,y) (see Fig. 4); 
one without class coverage, ca, and another, eg, with a 
single cover relation between contents A and B, such that 
A >~ B. In order to improve our numerical estimation 
of the classification with maximum likelihood, we imple- 
mented a simulated-annealing type of procedure for the 
optimization of C. 

In the previous section, we have shown that our gener- 
alized EM method is able to infer the underlying content- 
based structure that generated the network. These cal- 
culations were carried out assuming that the number of 
contents Af x coincides with the number of classes Af c . 
Let us therefore start by setting Af c — N x = 5. In Figure 
5, we show graphically the classifications obtained from 
the generalized EM method as applied to an ensemble of 
networks of size N = 50 generated with the connectivity 
functions of Figure 4. The color coding is based on the 
contents of the nodes and will be such that it matches in 
all the subsequent figures of the paper (A green, B red, C 
blue, D magenta and E cyan). The size of the spheres in 
the bottom plots are proportional to the probabilities qi r . 



For these examples the classification is rather good even 
in the case when a cover relation is present, as can be 
readily seen from the bottom diagrams where no major 
color is misplaced. In other words, there are not mis- 
classifications, although for the B case a small amount of 
leakage can be noticed. 

To try to quantify the quality of these results, we can, 
as a first measure, count the number of network realiza- 
tions in our ensemble for which at least two nodes with 
different contents have been assigned to the same class, 
with the understanding that a node i is assigned to a 
class r whenever qi r > 1/2. This is a strict criterion, 
since it may well be that we are considering as erroneous 
a classification with only a single node misclassified. The 
result can also slightly depend on the method applied to 
optimize the likelihood. Still, this definition is a way to 
play on safe ground and avoid to complicate too much 
the detection of mistakes in the classification. Let us call 
this then the error rate of the classification e. For each of 
the two connectivity functions of Fig. 4, we have studied 
over 2000 realizations of networks of size N = 50. In 
none of them the generalized EM method misclassified a 
single node. This result is in agreement with our earlier 
observation that the EM method classifies structurally 
equivalent nodes in the same way. 

The next question is then: how can the optimal A/" c be 
determined? If the networks studied are content-based, 
there are several possible answers to this. Here we will 
outline two of them and will discuss at the end of this 
section a third one in the context of inferring the class 
connectivity function. In Section IV, we have introduced 
a measure S q for the quality of an EM classification of the 
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network. We have also shown that when Af x — Af c , S q is 
either zero or exponentially small for large content-based 
networks. Therefore, a signal on S q can be expected for 
Af c = jV x , if the EM algorithm is faced with the chal- 
lenge of classifying a content-based network with a series 
of values Af c - This effect happens because the normal- 
ization conditions of Eqs. © and Q13p impose that no 
class can be left totally unassigned, 7r. r > for all r. The 
more redundant classes the method has to assign nodes 
to, the higher S q becomes. In other words, we are pro- 
viding the EM algorithm with a larger degree of freedom 
than required to properly classify the nodes. The extra 
freedom leads to structural leakage. The evolution of S q 
with Af c is displayed in Figure 6 for the two networks of 
Fig. 5. These are, of course, particular examples but 
some general features can be deduced. First, the value 
of S q is rather small or even zero for Af c < Af x . This 
may be a generic property of content-based networks. As 
noted before, the structural equivalence of nodes with the 
same content prevents the EM algorithm from putting 
such nodes into different classes. This means that once 
the contents are classified by classes the leakage comes 
from cover relations between classes and can become 
very small for big networks. On the other hand, when 
Af c > Af x , the availability of excess classes that cannot be 
left totally unassigned causes S q to be non-zero and to 
increase steadily with Af c . The boundary between these 
two types of behaviors is precisely the unknown Af c = Af x ■ 

Another peculiarity of the EM method applied to 
contents-based networks is that when Af c < Af x , the land- 
scape of the likelihood seems to have a very clear and 
unique maximum. The solution at the point of maximum 
C(tt, 0) has also a well determined value of S q . However, if 
Af c > Af x , the landscape of the likelihood becomes rough, 
with a large number of local maxima. The search for the 
global maximum under such conditions is therefore much 
harder. And even, in the cases where it can be numeri- 
cally found, say when Af c = Af x , it is formed by a set of 
degenerate extrema with the same value of L but very 
different values of S q . Indeed, the values of the entropy 
shown in Fig. 6 for Af c > Af x are averages over the best 
likelihood solutions found in different realizations of the 
optimization methods along with their standard devia- 
tions as- The dispersion erg, of S q around its average, 
can be used in practice as another estimator for the op- 
timal number of classes (see Fig. 6). 

Once Af x is known, it is possible to recover c(r, s) as 
explained in Section IV. In the top panels of Figure 7, 
the recovered c(r, s) is displayed for the content based 
networks of Figure 5. After the classes of c(r, s) have been 
properly reordered, it is impossible to distinguish the top 
panels of Fig. 7 from the connectivity functions given in 
Fig. 4. Also, in the lower panels of Figure 7, we have 
included the evolution of the entropy S c as a function of 
Af c - S c also shows a clear change of behavior at Af c = Af x , 
suggesting that the best content-based partition of the 
network happens when the number of classes equals the 
number of contents. Consequently, S Cl apart from being 



an estimator of how much a network deviates from a 
purely content-based graph, is also a useful quantity for 
deciding when Af c is optimal. 

VII. EM AND NOISY CONNECTIONS IN 
CONTENT-BASED NETWORKS 

It is unlikely that in real-world networks the generat- 
ing processes is error-free. Even if the underlaying struc- 
ture is expected to be a content-based network, errors 
in the connecting pattern could naturally arise. We try 
to mimic the unexpected connections as well as the ab- 
sence of expected connections, by introducing the cor- 
responding error probabilities to the process of network 
generation from its contents. As before, each node i has 
a content Xi assigned at random from the set of possi- 
ble contents (in the case of our example networks the 
same five possibilities: A, B, C, D and E). Once the con- 
tents are established, the structure of the content-based 
network should be determined completely by the connec- 
tivity function c(xi,Xj): If c(xi,Xj) — 1, there ought to 
be a link from node i to j, and none if c(xi, Xj) = 0. As 
a way of gradually loosing the content-based structure of 
the connections, we introduce now the probabilities P^, 
and P a , of not having a link, when c(xi,Xj) — 1 and 
having a link although c(xi,Xj) = 0, respectively. The 
networks constructed in this way can be regarded as hid- 
den variable graphs [HI, [U, H|| for which the probability 
of connection between any nodes i and j is expressed as 

r(xi,Xj) = c(xi,Xj) (1 - Pfj) + [1 - c(xi,Xj)]P a . (37) 

In other words, where in the absence of noise the proba- 
bility of having a connection was one, it now is 1— P^, and 
likewise, where it was zero, it now is P a . The extreme 
limit of this model occurs when P M = P a = 1/2, so that 
the probability of connecting to a node of other class is 
maximally random and independent of the connectivity 
function. We are more interested here in the limit when 
both P a and P M are much smaller than 1/2, and the re- 
sulting graphs can be seen as a slight modification of a 
content-based network. For the sake of simplicity, all of 
the results shown below are for P a = P^, 

Let us begin by looking at how the networks change 
with increasing assignment error. In the top panels of 
Figure 8, we display a series of networks generated with 
the connectivity function ca for different values P M = Pa- 
It is readily seen that the connection patterns associated 
with the different kinds of content becomes more and 
more diffuse. On the bottom panels of the same fig- 
ure, we show the corresponding class assignment prob- 
abilities qi r . While these are just examples, there are 
some features that are worth pointing out. The problems 
in the classification seem to appear somewhere between 
P a = P M = 1% and P a = P fl = 10%. Even at 10% of 
error the number of nodes misclassified in these networks 
is not very high. A closer inspection of the solution found 
shows that actually only two of the node contents-classes 
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FIG. 8: Same network as in Section A of Fig. 5 but with increasing error probability P a = P^. The values of P a are from left 
to right 0.001, 0.01 and 0.1. The plots on the lower panel are a graphic representation of the probability of classifying node i 
in class r, qi r , as before the radius of the spheres are proportional to qi r and the colors correspond to the actual content of the 
nodes (green A, red B, magenta C, blue D and cyan E). 



are mingled up, while all the remaining node classes are 
perfectly assigned. With the aim of quantifying these ob- 
servations, the behavior of e is plotted in Figure 9 versus 
the disorder probability. This plot is, of course, suscep- 
tible to slight changes depending on the method used 
to search for the maximum likelihood and depends on 
how many realizations of the content-based graphs were 
considered (in this case 1000). Nevertheless, in our sim- 
ulations the threshold for a sharp classification of all the 
nodes of the network is around P a = P M ss 5% for graphs 
without coverage, connectivity function c a . and much 
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FIG. 9: The error rate e as a function of the probabilities 
P a = P M for content-based networks generated with the con- 
nectivity functions of Figure 4 and with N c = M x = 5. 



FIG. 10: The average entropies over different realizations for 
content-based networks generated with the connectivity func- 
tions of Figure 4. In the top panels, S c is represented as a 
function of the number of classes jV c for two different levels 
of disorder: the circles are P a — P M = 1%, while the triangles 
for P a — P M = 10%. On the bottom panels, S q and as versus 
Af c for the disorder probabilities P a — P M = 1%, circles (S q ) 
and squares (as), and P a = P M = 10%, triangles (S q ) and 
stars (as). 



the order of magnitude of the threshold beyond which the 
content-based structure cannot be recovered anymore. 

A next aspect to consider is how the entropies S q 
and S c are affected by the intensity of the disorder, and 
whether they are still valid estimators to determine the 
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optimal number of classes. To answer this question, we 
fix the probabilities = P a at 1%, which seems to be 
a value where one might plausibly expect to obtain good 
classifications for both type of networks. In Figure 10, we 
display S q , as and S c , as function of the number of classes 
Af c with the results averaged over different content-based 
realizations. Indeed, at this level of disorder the entropies 
can still be used to estimate Af x . The noise in the con- 
nections introduces a small constant background for S c , 
which we will denote by S*, and which can be deter- 
mined in both examples from the behavior at high values 
of Af c . We can estimate the value of S* by noting that 
when Af c = Af x any non-zero entropy should essentially 
be due to the background from the random connections. 
Substituting the expression for r(xi,Xj), Eq. (I37p . into 
the definition of S c , Eq. (|2"3")l . should therefore give us an 
estimate for S*, 

2 

s * ~ ~ j^2 ln2 liir ( x >y) ( 38 ) 

c x -,y 

For P a = P M = 1%, this yields S* ~ 0.112, close to the 
value observed in the Figure 10 for Af c > 5. To check how 
well our estimate for S* agrees with the values obtained 
from simulations, we plot in Figure 11 S c vs. the disor- 
der probability at Af c = 5. When the disorder becomes 
very strong, on the other hand, it might not be possible to 
find an optimal Af c . Moreover, the presence of very differ- 
ent connection patterns for nodes with the same content 
renders the existence of such optimal number dubious. 
Therefore, apart from the obvious classification Af c = N, 
there may not be any other sharp classification. The ef- 
fects of high disorder can be seen in Figure 10, where the 
entropies S c and S q are represented as a function of Af c 
for P a = Pfj, = 10%. The results depend on the con- 
nectivity function, cj± seems a little more robust to the 
disorder as was confirmed by Fig. 9, but the signal in S q 
or as is clearly lost or has moved to higher values of Af c . 
Also S c has lost its capacity to predict Af x and smoothly 
falls for higher and higher values of Af c - It is worthy also 
noting that in spite of the lack of a method to find Af x , 
if Af c = 5 the EM method retrieves the appropriate hid- 
den variable theory connectivity function r(x,y) as can 
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FIG. 11: The average entropy S c as a function of the disorder 
probability P a — P^ for content-based networks generated 
with the connectivity functions c_a and cs depicted in Figure 
4. The red curves correspond to the value of S* . 



be inferred from the good fit produced by Eq. (|38| to S c 
shown in Figure 11. 

The numerical findings of this section show that the 
classifications of the EM method are robust to the intro- 
duction of noise in the connection patterns up to a certain 
point. The certainty of the classification will suffer, the 
stronger the disorder becomes. In fact this is one of the 
major merits of the EM method: it is able to extract the 
underlying content-based structure even in the presence 
of a certain level of noisy connections. 



VIII. CONCLUSIONS 

In summary, we have shown how the EM method for 
the classification of nodes of a network can be applied 
to content-based networks in order to extract the under- 
lying content-based structure even in the presence of a 
certain level of disorder in the connections. The appli- 
cation of the EM method to content-based networks is 
a natural concept that follows from the observation that 
the EM method classifies structurally equivalent nodes 
in an identical manner. In this sense, the EM method 
can be related to the Block Modeling techniques pro- 
posed in Social Sciences. Content based networks, on 
the other hand, are of great relevance, since they can be 
regarded as idealized paradigms of networks with com- 
munities or multipartite structures, including mixtures 
of both. Since in many realistic graphs the vertices carry 
additional attributes which might influence or even de- 
termine their connections to other vertices, being able to 
extract any content-based pattern can provide informa- 
tion about how the networks emerged. 

Our approach in this paper has been to start out with 
pure content-based graphs, and to show analytically as 
well as numerically that the EM method can infer the 
content-based connectivity pattern. We have shown also 
that the existence of cover-relations between contents 
leads to non-zero probabilities of mistaking nodes belong- 
ing to different classes. However, these probabilities van- 
ish exponentially with the increasing number of nodes, 
i.e., the more discriminating information provided to the 
method. By regarding more realistic networks as per- 
turbations of content-based networks under the addition 
or removal of connections, we then asked under which 
circumstances the EM method is still able to perform 
satisfactorily. There is a certain level of disorder beyond 
which the inference of the content-based structure, spe- 
cially the number of contents, becomes rather hard if not 
impossible. 

In order to estimate the quality of the classification and 
how far the structure of the network is from a content- 
based structure, we have introduced two entropies, S q 
and S c , which actually can be useful for the classification 
of any kind of graphs, including real-world networks. We 
have also shown that these entropies are applicable to 
deduce the optimal number of classes needed by the EM 
method to obtain a sharp classification of the nodes of 
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the network. 
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